传统的基于模型的增强学习(RL)方法使用学习的动力学模型生成前向推出轨迹,以减少与真实环境的相互作用。最近基于模型的RL方法考虑了学习向后模型的方法,该模型指定了前一个状态的条件概率给定了先前的动作和当前状态以生成向后推出轨迹。但是,在这种基于模型的方法中,从向后推出的样品和向前推出的样品简单地聚集在一起,以通过无模型的RL算法优化策略,这可能会降低样本效率和收敛速率。这是因为这种方法忽略了这样一个事实,即落后推出轨迹通常是从某些高价值状态开始产生的,并且对于代理人改善行为的肯定会更具启发性。在本文中,我们提出了向后的模仿和前向加强学习(BIFRL)框架,在该框架中,代理将向后的推出痕迹视为模仿出色行为的专家演示,然后收集策略强化的前向推出过渡。因此,BIFRL以更有效的方式使代理人能够从高价值状态伸出并从高价值状态进行探索,并进一步降低了实际相互作用,从而使其更适合于实体机器人学习。此外,引入了价值调节的生成对抗网络,以增强代理商很少收到的宝贵状态。从理论上讲,我们提供了BIFRL优于基线方法的条件。在实验上,我们证明了BIFRL获得了更好的样品效率,并在与基于最新模型的方法相比的各种Mujoco运动任务上产生了竞争性渐近性能。
translated by 谷歌翻译
使用浮点实数实现标准深度学习算法。这呈现了在可能没有专用浮点单元(FPU)的低端设备上实现它们的障碍。因此,Tinyml的研究人员认为可以使用Integer操作在低端设备上培训和运行深神经网络(DNN)的机器学习算法。本文在纯C ++中提出了Pocketnn,轻型和独立的概念概念框架,用于仅使用整数的DNN训练和推断。与其他方法不同,PocketNN直接在整数上运行,而无需任何显式量化算法或定制的定期点格式。这是通过口袋激活来实现的,这是一个用于整数DNN的激活函数系列,以及称为直接反馈对准(DFA)的新兴DNN训练算法。与标准BackPropagation(BP)不同,DFA独立列举每个图层,从而避免在使用仅具有整数操作的BP时是一个关键问题的整数溢出。我们使用Pocketnn在两个着名的数据集,Mnist和Fashion-Mnist上培训一些DNN。我们的实验表明,DNN与我们的PocketNN接受过的DNN培训,分别在MNIST和Fashion-Mnist数据集中获得了96.98%和87.7%的准确性。精度非常接近使用具有浮点实数操作的BP培训的等效DNN,使得精度降解分别为1.02%p和2.09%p。最后,我们的PocketNN为低端设备具有高兼容性和可移植性,因为它是开源的开源,并在纯C ++中实现,没有任何依赖项。
translated by 谷歌翻译
通常手动构建课程或分类层次结构,以及我们对世界的一部分知识。在本文中,我们提出了一种新颖的算法,用于从分类器自动获取类层次结构,这些时间这些天通常是一个大的神经网络。我们从分类器中所需的信息是它的混淆矩阵,它包含对每对基类的,分类器通过误导另一对的错误误差。我们的算法对于在CiFar-10数据集上培训的一些众所周知的深神经网络模型,用于预测非原生英语扬声器的母语的神经网络模型,是一种用于检测A的语言的神经网络模型的神经网络模型书面文本,以及用于识别音乐类型的分类器。在文献中,这些类层次结构已被用于为神经网络提供可解释性。我们还讨论了所获取的层次结构的其他一些潜在用途。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. Specifically, we propose AlipayKG to explicitly characterize user intent, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
translated by 谷歌翻译
As natural language processing (NLP) for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques such as large-scale language models suffer from data inadequacy and biased corpus, especially for languages with insufficient resources such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation CORGI-PM, which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. Moreover, we address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias. We also conduct experiments with state-of-the-art language models to provide baselines. To our best knowledge, CORGI-PM is the first sentence-level Chinese corpus for gender bias probing and mitigation.
translated by 谷歌翻译